Hello everybody and welcome to the video nugget on language models.
So remember a formal language is a set of strings and formal languages like Java or
C++. It is easy to understand what is a valid Java or C++ string. We just apply the language
grammar and if it says yes, then it is and if the grammar says no, then it isn't. So
it's a decidable problem. For languages like natural languages like English, German, Spanish,
Chinese or so, this is not the case. So we look at a concrete example in English. So
something like not to be invited is sad is definitely English, whereas to not be invited
is sad is slightly controversial. So the idea is that instead of having a formal language
where we have a clear zero one, if you will, criterion, whether something is a string of
a language in natural languages, it's actually better to use a probability distribution,
which would be something like near one for the first example and less than one, significantly
less than one for the second example. So such a probability distribution we'll call a language
model. And there's kind of a choice to make here. We can have language models of characters
or language models of words. And we're going to see both of those in this video nugget.
So the idea is that we try to derive the language model from a text corpus, which is essentially
a large and structured set of texts. There's a whole area called corpus linguistics, which
is just basically deriving stuff like language models, but also other things from corporate.
So corporate are used for statistical analysis. We're going to see this hypothesis testing
and doing things like validating linguistic rules or something like that. And that's something
we're going to. So we're going to presuppose that we have a text corpus and getting and
curating those is a non-trivial task force. There are large text corpus, the corpora, for
instance, a large corpus of English newswire text is used in the pantry bank, which you
may have heard in German language, there's a corpus by the Battlesman company. They've
basically collected all the newspapers they publish for the last 30 years and so on. That's
for instance, being used to curate the words of the dictionary block house, which you may
have heard about. So we're going to, I'm going to first look into Ngram character models.
So remember a written texts are composed of characters, which are letters of digits and
punctuations and spacing and so on. So we can study languages, language models for sequences
of characters. And we're going to use something we've studied before, namely, we are going
to look at this as a Markov like process. And we're going to take the notation we've
developed from Markov processes, basically saying, well, the sequence or CN, we're going
to write as C1 colon N for this length. So we're going to call a character sequence of
length N and Ngram for one, two and three, we're going to use unigram, bigram and trigram
as traditional names. And we're just going to think of an Ngram model as a Markov chain
of order N minus one. So for a trigram model, we're going to get the task of kind of predicting
the next character, the probability of CI, given the observation of the previous characters
in a trigram model is just basically CI, given CI minus one and CI minus two. So if we kind
of factor with the chain rule, the probability of a particular length N sequence is just via
the chain rule, which we now can use the Markov property, we're getting basically this big
problem, this big product. And the important thing to see here is that the only thing we
need is actually this conditional probability. And we can see that we have, and to calculate
that we need three, we need three to the power of how many characters we have. So for a trigram
gram model for a language with 100 characters, which is about what you need, kind of if you
count in digits, punctuation spaces, and so on, we need a million entries for that model.
And of course, you need a big corpus for that. And so you need a corpus with 10 to the seven
characters. Okay, so that's doable. 10 to the seven characters is something like 10
to the four, so 10,000 pages, that's easy to collect. So character models are something
we can relatively easily do. So the question might be, what can we do with those? One of
the things is language identification. You may want to know what natural language a text
is written in. Typically, you would like to know which lexicon to use, which language
Presenters
Zugänglich über
Offener Zugang
Dauer
00:25:30 Min
Aufnahmedatum
2021-07-01
Hochgeladen am
2021-07-01 11:17:06
Sprache
en-US